The hackers or the attacker's purpose is to obtain the victim's personal information, user credentials, or to install malware on our devices etc. Researchers propose a number of strategies or techniques to overcome this problem, but machine learning-based detection outshine all of them. This report proposes an idea by examine the URL's lexical features etc. Train and test the model, evaluate it with 7 different machine learning algorithms or techniques. As a result, in terms of accuracy, the Random Forest classifier outperforms the other classifiers.
#basic dependencies
import numpy as np #linear algebra
import pandas as pd #data processing
import math
import os
import sys
import re
#sklearn dependencies
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
#algorithms
from sklearn.ensemble import RandomForestClassifier #random forest
from sklearn.linear_model import LogisticRegression #logistic regression
from sklearn.tree import DecisionTreeClassifier #decision tree
from sklearn.neural_network import MLPClassifier #multilayer perceptron
from sklearn.naive_bayes import GaussianNB #naive bayes
from sklearn.ensemble import GradientBoostingClassifier #gradient boosting
from sklearn.linear_model import SGDClassifier #Stochastic Gradient Descent
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
#plotting dependencies
import seaborn as sns
import matplotlib.pyplot as plt
#to remove the warning messages
import warnings
warnings.simplefilter('ignore')
This data was retrieved from Kaggle.com. Which it contains a huge dataset of 651,191 URLs, out of which 428103 benign or safe URLs, 96457 defacement URLs, 94111 phishing URLs, and 32520 malware URLs.
#initialize dataframe
#read the csv file using pandas
data = pd.read_csv('C:/Users/YVONNE.C/Desktop/Final Project/malicious_phish.csv')
data.head(20)
| url | type | |
|---|---|---|
| 0 | br-icloud.com.br | phishing |
| 1 | mp3raid.com/music/krizz_kaliko.html | benign |
| 2 | bopsecrets.org/rexroth/cr/1.htm | benign |
| 3 | http://www.garage-pirenne.be/index.php?option=... | defacement |
| 4 | http://adventure-nicaragua.net/index.php?optio... | defacement |
| 5 | http://buzzfil.net/m/show-art/ils-etaient-loin... | benign |
| 6 | espn.go.com/nba/player/_/id/3457/brandon-rush | benign |
| 7 | yourbittorrent.com/?q=anthony-hamilton-soulife | benign |
| 8 | http://www.pashminaonline.com/pure-pashminas | defacement |
| 9 | allmusic.com/album/crazy-from-the-heat-r16990 | benign |
| 10 | corporationwiki.com/Ohio/Columbus/frank-s-bens... | benign |
| 11 | http://www.ikenmijnkunst.nl/index.php/expositi... | defacement |
| 12 | myspace.com/video/vid/30602581 | benign |
| 13 | http://www.lebensmittel-ueberwachung.de/index.... | defacement |
| 14 | http://www.szabadmunkaero.hu/cimoldal.html?sta... | defacement |
| 15 | http://larcadelcarnevale.com/catalogo/palloncini | defacement |
| 16 | quickfacts.census.gov/qfd/maps/iowa_map.html | benign |
| 17 | nugget.ca/ArticleDisplay.aspx?archive=true&e=1... | benign |
| 18 | uk.linkedin.com/pub/steve-rubenstein/8/718/755 | benign |
| 19 | http://www.vnic.co/khach-hang.html | defacement |
#display the data
print(data)
url type 0 br-icloud.com.br phishing 1 mp3raid.com/music/krizz_kaliko.html benign 2 bopsecrets.org/rexroth/cr/1.htm benign 3 http://www.garage-pirenne.be/index.php?option=... defacement 4 http://adventure-nicaragua.net/index.php?optio... defacement ... ... ... 651186 xbox360.ign.com/objects/850/850402.html phishing 651187 games.teamxbox.com/xbox-360/1860/Dead-Space/ phishing 651188 www.gamespot.com/xbox360/action/deadspace/ phishing 651189 en.wikipedia.org/wiki/Dead_Space_(video_game) phishing 651190 www.angelfire.com/goth/devilmaycrytonite/ phishing [651191 rows x 2 columns]
#The number of samples present in the data
#Information about the dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 651191 entries, 0 to 651190 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 url 651191 non-null object 1 type 651191 non-null object dtypes: object(2) memory usage: 9.9+ MB
The dataframe has 651191 rows and 2 columns
#check to see if there is null value
data.isnull().sum()
url 0 type 0 dtype: int64
There are no null value in any of the columns. All the columns are of type object.
#There are 4 types of URL
#other than benign URLs is safe, the rest are not
count = data.type.value_counts()
count
benign 428103 defacement 96457 phishing 94111 malware 32520 Name: type, dtype: int64
data.describe()
| url | type | |
|---|---|---|
| count | 651191 | 651191 |
| unique | 641119 | 4 |
| top | http://style.org.hc360.com/css/detail/mysite/s... | benign |
| freq | 180 | 428103 |
#use a "dummy final year project URLs" link as an example
from IPython.display import Image
Image(url = "url_pic.png", width = 900, height = 800)
sns.barplot(x=count.index, y=count, palette="Set2")
plt.xlabel('Types')
plt.ylabel('Count')
Text(0, 0.5, 'Count')
print("Percent Of Benign URLs:{:.2f} %".format(len(data[data['type']=='benign'])/len(data['type'])*100))
print("Percent Of Defacement URLs:{:.2f} %".format(len(data[data['type']=='defacement'])/len(data['type'])*100))
print("Percent Of Phishing URLs:{:.2f} %".format(len(data[data['type']=='phishing'])/len(data['type'])*100))
print("Percent Of Malware URLs:{:.2f} %".format(len(data[data['type']=='malware'])/len(data['type'])*100))
Percent Of Benign URLs:65.74 % Percent Of Defacement URLs:14.81 % Percent Of Phishing URLs:14.45 % Percent Of Malware URLs:4.99 %
import plotly.graph_objects as go
labels = ['Defacement','Benign','Phishing','Malware']
values = [ 96457, 428103, 94111, 32520]
# pull is given as a fraction of the pie radius
fig = go.Figure(data=[go.Pie(labels=labels, values=values, pull=[0, 0.1, 0, 0])])
fig.show()
This dataset have about 65.74% out of 100% URLs that is safe and 34.25% out of 100% that is malicious URLs. Therefore, this shows that there are a lot more benign(safe) URL's than the rest of malicious URL's. Which consider not too bad.
#add new column which counts the length of the URL for each row
data["url_length"] = data["url"].str.len()
data.head()
| url | type | url_length | |
|---|---|---|---|
| 0 | br-icloud.com.br | phishing | 16 |
| 1 | mp3raid.com/music/krizz_kaliko.html | benign | 35 |
| 2 | bopsecrets.org/rexroth/cr/1.htm | benign | 31 |
| 3 | http://www.garage-pirenne.be/index.php?option=... | defacement | 88 |
| 4 | http://adventure-nicaragua.net/index.php?optio... | defacement | 235 |
plt.figure(figsize=(15,5))
plt.hist(data['url_length'],bins=50,color='orchid')
plt.title("URL-Length",fontsize=20)
plt.xlabel("Url Length",fontsize=15)
plt.ylabel("Number Of URLs",fontsize=15)
plt.ylim(0,1000)
(0.0, 1000.0)
Theoretically speaking "https" is more secure than "http", because "https" use data encryption and it scrambles the data before transmission. But base on the current data, there are some of the phishing website or malware URLs is also using "https".
Therefore, "https" site still can be hacked and does not confirm that the site is legitimate.
By parsing the URLs to get more insights of the protocol of the URLs to see if it exists.
#The URL parsing functions focus on splitting a URL string into its components/combining URL components into URL string
from urllib.parse import urlparse
#count the times of http: and https: appear per URL
data['number_of_http'] = [x.count('http:') for x in data['url']]
data['number_of_https'] = [x.count('https:') for x in data['url']]
#Parse URLs into components and check the protocol of the URL
def https(o):
https = urlparse(o).scheme
string_https = str(https)
if string_https=='https':
return 1 #https is present
else:
return 0 #https is not present
#adding it to the main dataframe
data['https'] = [https(x) for x in data["url"]]
data.head()
| url | type | url_length | number_of_http | number_of_https | https | |
|---|---|---|---|---|---|---|
| 0 | br-icloud.com.br | phishing | 16 | 0 | 0 | 0 |
| 1 | mp3raid.com/music/krizz_kaliko.html | benign | 35 | 0 | 0 | 0 |
| 2 | bopsecrets.org/rexroth/cr/1.htm | benign | 31 | 0 | 0 | 0 |
| 3 | http://www.garage-pirenne.be/index.php?option=... | defacement | 88 | 1 | 0 | 0 |
| 4 | http://adventure-nicaragua.net/index.php?optio... | defacement | 235 | 1 | 0 | 0 |
#HTTPs
plt.figure(figsize=(15,5))
plt.title("Number Of HTTPs In URL",fontsize=20)
sns.countplot(x='number_of_https',data=data, palette="hls", hue='type')
plt.xlabel("Number Of https",fontsize=15)
plt.ylabel("Number Of URLs",fontsize=15)
Text(0, 0.5, 'Number Of URLs')
#HTTP
plt.figure(figsize=(15,5))
plt.title("Number Of HTTP In URL",fontsize=20)
sns.countplot(x='number_of_http',data=data, palette="hls", hue='type')
plt.xlabel("Number Of http",fontsize=15)
plt.ylabel("Number Of URLs",fontsize=15)
Text(0, 0.5, 'Number Of URLs')
ax = sns.countplot(x='https', data=data, palette="Set2", hue="type")
ticks = ["No", "Yes"]
_ = ax.set_xticklabels(ticks)
_ = ax.set_title("The presence of HTTPs in the URL")
_ = ax.set_xlabel("https")
_ = ax.set_ylabel("Count")
_ = plt.legend(loc='upper right')
A URL shortening service is to condenses web addresses. There are App out there also known as a link shortener, redirects the shorter URL to the original.
But in this case, URL Shortening services are a great source for spammers and hackers to get a hold of victim computer. By sharing a link and fool user into clicking the link.
Thus, either setting up the virus in the victim computer or getting important user credentials through fake URL. This definitely breaches privacy and security of the user.
#to check any URLs use shortening services
def shortening_service(url):
match = re.search('bit\.ly|goo\.gl|shorte\.st|go2l\.ink|x\.co|ow\.ly|t\.co|tinyurl|tr\.im|is\.gd|cli\.gs|'
'yfrog\.com|migre\.me|ff\.im|tiny\.cc|url4\.eu|twit\.ac|su\.pr|twurl\.nl|snipurl\.com|'
'short\.to|BudURL\.com|ping\.fm|post\.ly|Just\.as|bkite\.com|snipr\.com|fic\.kr|loopt\.us|'
'doiop\.com|short\.ie|kl\.am|wp\.me|rubyurl\.com|om\.ly|to\.ly|bit\.do|t\.co|lnkd\.in|'
'db\.tt|qr\.ae|adf\.ly|goo\.gl|bitly\.com|cur\.lv|tinyurl\.com|ow\.ly|bit\.ly|ity\.im|'
'q\.gs|is\.gd|po\.st|bc\.vc|twitthis\.com|u\.to|j\.mp|buzurl\.com|cutt\.us|u\.bb|yourls\.org|'
'x\.co|prettylinkpro\.com|scrnch\.me|filoops\.info|vzturl\.com|qr\.net|1url\.com|tweez\.me|v\.gd|'
'tr\.im|link\.zip\.net',
url)
if match:
return -1
else:
return 1
data['short_url'] = data['url'].apply(lambda i: shortening_service(i))
plt.figure(figsize=(15,5))
plt.title("Shortening service",fontsize=20)
sns.countplot(x='short_url',data=data, palette="cubehelix", hue='type')
plt.xlabel("short_url",fontsize=15)
plt.ylabel("Number Of URLs",fontsize=15)
Text(0, 0.5, 'Number Of URLs')
For shortening URL "1" refer to the URL does not contain any of the shortening service whereas for "-1" means that there are URL which contain the shortening part.
#Parse URLs into components and check the length of the hostname or the domain
data['domain_length'] = [len(urlparse(x).netloc) for x in data['url']]
data.head()
| url | type | url_length | number_of_http | number_of_https | https | short_url | domain_length | |
|---|---|---|---|---|---|---|---|---|
| 0 | br-icloud.com.br | phishing | 16 | 0 | 0 | 0 | 1 | 0 |
| 1 | mp3raid.com/music/krizz_kaliko.html | benign | 35 | 0 | 0 | 0 | 1 | 0 |
| 2 | bopsecrets.org/rexroth/cr/1.htm | benign | 31 | 0 | 0 | 0 | 1 | 0 |
| 3 | http://www.garage-pirenne.be/index.php?option=... | defacement | 88 | 1 | 0 | 0 | 1 | 21 |
| 4 | http://adventure-nicaragua.net/index.php?optio... | defacement | 235 | 1 | 0 | 0 | 1 | 23 |
#domain length
plt.figure(figsize=(20,5))
plt.hist(data['domain_length'],bins=50,color='salmon')
plt.title("Hostname-Length",fontsize=20)
plt.xlabel("Length Of Hostname",fontsize=18)
plt.ylabel("Number Of Urls",fontsize=18)
plt.ylim(0,1000)
(0.0, 1000.0)
#Count different features
data['number_of_dots'] = [x.count('.') for x in data['url']] # "."
data['number_of_www'] = [x.count('www.') for x in data['url']] #"www."
data['number_of_dot_com'] = [x.count('.com') for x in data['url']] #".com"
data['number_of_index'] = [x.count('index') for x in data['url']] #"index"
#count or explore more characters in the URLs
data['number_of_question_mark'] = [x.count('?') for x in data['url']] #"?"
data['number_of_equal'] = [x.count('=') for x in data['url']] #"="
data['number_of_underscore'] = [x.count('_') for x in data['url']] #"_"
data['number_of_dash'] = [x.count('-') for x in data['url']] #"-"
data['number_of_doubleslash'] = [x.count('//') for x in data['url']] #"//"
data['number_of_backslash'] = [x.count('\\') for x in data['url']] #"\\"
data['number_of_hashtag'] = [x.count('#') for x in data['url']] #"#"
data['number_of_plus'] = [x.count('+') for x in data['url']] #"+"
data['number_of_percentage'] = [x.count('%') for x in data['url']] #"%"
data['number_of_at_sign'] = [x.count('@') for x in data['url']] #"@"
data['number_of_space'] = [x.count(' ') for x in data['url']] #" "
data['number_of_colon'] = [x.count(':') for x in data['url']] #":"
data['number_of_and'] = [x.count('&') for x in data['url']] #"&"
data['number_of_semicolon'] = [x.count(';') for x in data['url']] #";"
data['number_of_exclamation'] = [x.count('!') for x in data['url']] #"!"
data.head(12)
| url | type | url_length | number_of_http | number_of_https | https | short_url | domain_length | number_of_dots | number_of_www | ... | number_of_backslash | number_of_hashtag | number_of_plus | number_of_percentage | number_of_at_sign | number_of_space | number_of_colon | number_of_and | number_of_semicolon | number_of_exclamation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | br-icloud.com.br | phishing | 16 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | mp3raid.com/music/krizz_kaliko.html | benign | 35 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | bopsecrets.org/rexroth/cr/1.htm | benign | 31 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | http://www.garage-pirenne.be/index.php?option=... | defacement | 88 | 1 | 0 | 0 | 1 | 21 | 3 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 0 | 0 |
| 4 | http://adventure-nicaragua.net/index.php?optio... | defacement | 235 | 1 | 0 | 0 | 1 | 23 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 0 | 0 |
| 5 | http://buzzfil.net/m/show-art/ils-etaient-loin... | benign | 118 | 1 | 0 | 0 | 1 | 11 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 6 | espn.go.com/nba/player/_/id/3457/brandon-rush | benign | 45 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | yourbittorrent.com/?q=anthony-hamilton-soulife | benign | 46 | 0 | 0 | 0 | -1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 8 | http://www.pashminaonline.com/pure-pashminas | defacement | 44 | 1 | 0 | 0 | 1 | 22 | 2 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 9 | allmusic.com/album/crazy-from-the-heat-r16990 | benign | 45 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 10 | corporationwiki.com/Ohio/Columbus/frank-s-bens... | benign | 62 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 11 | http://www.ikenmijnkunst.nl/index.php/expositi... | defacement | 64 | 1 | 0 | 0 | 1 | 20 | 3 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
12 rows × 27 columns
##find the correlation between the features and plot the heat map
cmap= sns.cubehelix_palette()
plt.figure(figsize=(12, 8))
sns.heatmap(data.corr(), cmap="coolwarm", linewidths=3.0) #the line width between each other
<AxesSubplot:>
data.describe()
| url_length | number_of_http | number_of_https | https | short_url | domain_length | number_of_dots | number_of_www | number_of_dot_com | number_of_index | ... | number_of_backslash | number_of_hashtag | number_of_plus | number_of_percentage | number_of_at_sign | number_of_space | number_of_colon | number_of_and | number_of_semicolon | number_of_exclamation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.00000 | 651191.000000 | 651191.000000 | ... | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 | 651191.000000 |
| mean | 60.156831 | 0.267279 | 0.024709 | 0.024079 | 0.877901 | 5.023088 | 2.195453 | 0.19099 | 0.663876 | 0.133710 | ... | 0.024819 | 0.000871 | 0.068432 | 0.519502 | 0.002219 | 0.000645 | 0.342509 | 0.380497 | 0.038896 | 0.000954 |
| std | 44.753902 | 0.448633 | 0.156320 | 0.153294 | 0.478843 | 8.911953 | 1.490732 | 0.39676 | 0.504236 | 0.350767 | ... | 0.453892 | 0.032327 | 0.621276 | 4.462254 | 0.054183 | 0.066823 | 0.600128 | 1.224169 | 0.558691 | 0.039546 |
| min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | -1.000000 | 0.000000 | 0.000000 | 0.00000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 32.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 1.000000 | 0.00000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 47.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 0.00000 | 1.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 77.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 11.000000 | 3.000000 | 0.00000 | 1.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 2175.000000 | 4.000000 | 5.000000 | 1.000000 | 1.000000 | 236.000000 | 42.000000 | 5.00000 | 13.000000 | 6.000000 | ... | 71.000000 | 6.000000 | 37.000000 | 231.000000 | 10.000000 | 43.000000 | 11.000000 | 50.000000 | 104.000000 | 5.000000 |
8 rows × 25 columns
A correlation matrix denotes the correlation coefficients between variables at the same time. A heat map grid can represent these coefficients to build a visual representation of the variables’ dependence. This visualization makes it easy to spot the strong dependencies.
A positive correlation indicates a strong dependency.
A negative correlation indicates a strong inverse dependency; a correlation coefficient closer to zero indicates weak dependence.
corr_matrix = data.corr()
corr_matrix
| url_length | number_of_http | number_of_https | https | short_url | domain_length | number_of_dots | number_of_www | number_of_dot_com | number_of_index | ... | number_of_backslash | number_of_hashtag | number_of_plus | number_of_percentage | number_of_at_sign | number_of_space | number_of_colon | number_of_and | number_of_semicolon | number_of_exclamation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| url_length | 1.000000 | 0.362476 | 0.079591 | 0.072201 | -0.014364 | 0.354189 | 0.382767 | 0.159799 | 0.017601 | 0.268490 | ... | 0.210612 | 0.028476 | 0.113847 | 0.304816 | 0.049206 | 0.013410 | 0.400533 | 0.464429 | 0.218538 | 0.023567 |
| number_of_http | 0.362476 | 1.000000 | -0.087403 | -0.091839 | 0.053923 | 0.823687 | 0.223438 | 0.380378 | -0.239566 | 0.389234 | ... | -0.004342 | -0.011494 | 0.001992 | 0.145725 | 0.002387 | -0.004675 | 0.836302 | 0.366185 | -0.012437 | -0.010039 |
| number_of_https | 0.079591 | -0.087403 | 1.000000 | 0.984883 | -0.023212 | 0.217479 | 0.025827 | -0.042365 | 0.016234 | -0.042049 | ... | -0.007150 | 0.046188 | 0.017155 | 0.029774 | 0.089075 | 0.008912 | 0.189624 | -0.004495 | 0.034695 | 0.054813 |
| https | 0.072201 | -0.091839 | 0.984883 | 1.000000 | -0.021328 | 0.218280 | 0.011332 | -0.047738 | 0.012286 | -0.044141 | ... | -0.008589 | 0.047210 | 0.017801 | 0.030007 | 0.089152 | 0.008528 | 0.182440 | -0.010820 | 0.021572 | 0.055994 |
| short_url | -0.014364 | 0.053923 | -0.023212 | -0.021328 | 1.000000 | 0.040704 | -0.030930 | 0.035174 | -0.168322 | 0.032250 | ... | 0.004574 | -0.022298 | 0.003009 | 0.021657 | -0.010036 | -0.006561 | 0.041110 | 0.026106 | -0.011373 | 0.002256 |
| domain_length | 0.354189 | 0.823687 | 0.217479 | 0.218280 | 0.040704 | 1.000000 | 0.285763 | 0.407208 | -0.193097 | 0.367465 | ... | -0.012579 | 0.012712 | -0.006366 | 0.054787 | 0.029241 | -0.000474 | 0.784693 | 0.346305 | -0.004096 | 0.006238 |
| number_of_dots | 0.382767 | 0.223438 | 0.025827 | 0.011332 | -0.030930 | 0.285763 | 1.000000 | 0.359155 | -0.058686 | 0.260992 | ... | 0.088327 | 0.035473 | -0.015390 | -0.046009 | 0.037445 | 0.011653 | 0.221461 | 0.191947 | 0.120025 | -0.002667 |
| number_of_www | 0.159799 | 0.380378 | -0.042365 | -0.047738 | 0.035174 | 0.407208 | 0.359155 | 1.000000 | -0.159643 | 0.365725 | ... | 0.028969 | 0.014692 | -0.047808 | -0.046037 | -0.001427 | 0.003289 | 0.344372 | 0.264036 | 0.022837 | -0.008966 |
| number_of_dot_com | 0.017601 | -0.239566 | 0.016234 | 0.012286 | -0.168322 | -0.193097 | -0.058686 | -0.159643 | 1.000000 | -0.119238 | ... | 0.034793 | 0.013998 | 0.011821 | -0.060208 | 0.041295 | 0.005112 | -0.226680 | -0.090998 | 0.029271 | 0.003060 |
| number_of_index | 0.268490 | 0.389234 | -0.042049 | -0.044141 | 0.032250 | 0.367465 | 0.260992 | 0.365725 | -0.119238 | 1.000000 | ... | 0.044108 | 0.004901 | -0.033080 | -0.031074 | 0.008548 | 0.002021 | 0.420024 | 0.502228 | 0.025038 | -0.007089 |
| number_of_question_mark | 0.416174 | 0.274031 | 0.021671 | 0.013330 | 0.030971 | 0.243867 | 0.272678 | 0.219665 | -0.013904 | 0.456582 | ... | 0.111940 | 0.013363 | 0.032794 | 0.022470 | 0.045904 | 0.007084 | 0.315744 | 0.562101 | 0.153660 | 0.020597 |
| number_of_equal | 0.506073 | 0.378732 | 0.002842 | -0.003854 | 0.030234 | 0.360131 | 0.227160 | 0.275122 | -0.083110 | 0.530800 | ... | 0.060103 | 0.011011 | 0.011490 | -0.000234 | 0.020055 | 0.000930 | 0.496941 | 0.958435 | 0.192007 | 0.013362 |
| number_of_underscore | 0.267169 | 0.019880 | -0.011288 | -0.012642 | 0.010419 | 0.015939 | 0.043121 | -0.006953 | -0.023076 | 0.080947 | ... | 0.036364 | 0.003325 | -0.008172 | -0.006497 | 0.004195 | 0.001455 | 0.045871 | 0.188394 | 0.036850 | 0.010929 |
| number_of_dash | 0.428597 | 0.158087 | -0.005148 | -0.004839 | -0.048113 | 0.094369 | -0.079345 | -0.064314 | 0.062972 | -0.056200 | ... | 0.029538 | -0.001231 | -0.027098 | 0.018230 | 0.013044 | -0.001848 | 0.149401 | -0.005562 | 0.010030 | 0.000238 |
| number_of_doubleslash | 0.377295 | 0.936176 | 0.249241 | 0.244614 | 0.045045 | 0.873297 | 0.222487 | 0.355012 | -0.228924 | 0.361592 | ... | -0.010085 | 0.006087 | 0.007630 | 0.151389 | 0.033286 | -0.000937 | 0.871377 | 0.352514 | -0.003199 | 0.008777 |
| number_of_backslash | 0.210612 | -0.004342 | -0.007150 | -0.008589 | 0.004574 | -0.012579 | 0.088327 | 0.028969 | 0.034793 | 0.044108 | ... | 1.000000 | 0.025425 | 0.008686 | 0.087794 | 0.015182 | 0.009193 | -0.004925 | 0.048104 | 0.115625 | 0.006723 |
| number_of_hashtag | 0.028476 | -0.011494 | 0.046188 | 0.047210 | -0.022298 | 0.012712 | 0.035473 | 0.014692 | 0.013998 | 0.004901 | ... | 0.025425 | 1.000000 | -0.000902 | 0.002176 | 0.110241 | 0.126991 | 0.004812 | 0.011302 | 0.030690 | 0.071424 |
| number_of_plus | 0.113847 | 0.001992 | 0.017155 | 0.017801 | 0.003009 | -0.006366 | -0.015390 | -0.047808 | 0.011821 | -0.033080 | ... | 0.008686 | -0.000902 | 1.000000 | 0.151154 | -0.001409 | 0.000305 | 0.000215 | 0.003318 | -0.002298 | 0.004032 |
| number_of_percentage | 0.304816 | 0.145725 | 0.029774 | 0.030007 | 0.021657 | 0.054787 | -0.046009 | -0.046037 | -0.060208 | -0.031074 | ... | 0.087794 | 0.002176 | 0.151154 | 1.000000 | -0.000506 | -0.000114 | 0.111448 | -0.006304 | 0.008075 | 0.001022 |
| number_of_at_sign | 0.049206 | 0.002387 | 0.089075 | 0.089152 | -0.010036 | 0.029241 | 0.037445 | -0.001427 | 0.041295 | 0.008548 | ... | 0.015182 | 0.110241 | -0.001409 | -0.000506 | 1.000000 | 0.011056 | 0.023900 | 0.017414 | 0.063705 | 0.037713 |
| number_of_space | 0.013410 | -0.004675 | 0.008912 | 0.008528 | -0.006561 | -0.000474 | 0.011653 | 0.003289 | 0.005112 | 0.002021 | ... | 0.009193 | 0.126991 | 0.000305 | -0.000114 | 0.011056 | 1.000000 | 0.000886 | 0.001562 | 0.009077 | 0.029985 |
| number_of_colon | 0.400533 | 0.836302 | 0.189624 | 0.182440 | 0.041110 | 0.784693 | 0.221461 | 0.344372 | -0.226680 | 0.420024 | ... | -0.004925 | 0.004812 | 0.000215 | 0.111448 | 0.023900 | 0.000886 | 1.000000 | 0.498975 | 0.005421 | 0.007525 |
| number_of_and | 0.464429 | 0.366185 | -0.004495 | -0.010820 | 0.026106 | 0.346305 | 0.191947 | 0.264036 | -0.090998 | 0.502228 | ... | 0.048104 | 0.011302 | 0.003318 | -0.006304 | 0.017414 | 0.001562 | 0.498975 | 1.000000 | 0.199884 | 0.015153 |
| number_of_semicolon | 0.218538 | -0.012437 | 0.034695 | 0.021572 | -0.011373 | -0.004096 | 0.120025 | 0.022837 | 0.029271 | 0.025038 | ... | 0.115625 | 0.030690 | -0.002298 | 0.008075 | 0.063705 | 0.009077 | 0.005421 | 0.199884 | 1.000000 | 0.013543 |
| number_of_exclamation | 0.023567 | -0.010039 | 0.054813 | 0.055994 | 0.002256 | 0.006238 | -0.002667 | -0.008966 | 0.003060 | -0.007089 | ... | 0.006723 | 0.071424 | 0.004032 | 0.001022 | 0.037713 | 0.029985 | 0.007525 | 0.015153 | 0.013543 | 1.000000 |
25 rows × 25 columns
#check the minimum and maximum length of the URLs in the dataset for more insights
column = data["url_length"]
maximum_index = column.idxmax()
print("The maximum url_length index is -", maximum_index)
minimum_index = column.idxmin()
print("The minimum url_length index is -", minimum_index)
The maximum url_length index is - 579857 The minimum url_length index is - 573437
data.loc[(maximum_index)]
url peekaboopoles.co.uk/holding/payza.com/accounts... type benign url_length 2175 number_of_http 0 number_of_https 0 https 0 short_url 1 domain_length 0 number_of_dots 13 number_of_www 0 number_of_dot_com 1 number_of_index 0 number_of_question_mark 0 number_of_equal 0 number_of_underscore 0 number_of_dash 0 number_of_doubleslash 0 number_of_backslash 0 number_of_hashtag 0 number_of_plus 0 number_of_percentage 0 number_of_at_sign 0 number_of_space 0 number_of_colon 0 number_of_and 0 number_of_semicolon 0 number_of_exclamation 0 Name: 579857, dtype: object
At this point we can tell that the longest URL length is a benign URL whereas the shortest is a phishing URL.
data.loc[(minimum_index)]
url type phishing url_length 1 number_of_http 0 number_of_https 0 https 0 short_url 1 domain_length 0 number_of_dots 0 number_of_www 0 number_of_dot_com 0 number_of_index 0 number_of_question_mark 0 number_of_equal 0 number_of_underscore 0 number_of_dash 0 number_of_doubleslash 0 number_of_backslash 0 number_of_hashtag 0 number_of_plus 0 number_of_percentage 0 number_of_at_sign 0 number_of_space 0 number_of_colon 0 number_of_and 0 number_of_semicolon 0 number_of_exclamation 0 Name: 573437, dtype: object
#The longest URL throughout the entire dataset
list_url = list(data['url'][579857])
print("".join(list_url))
peekaboopoles.co.uk/holding/payza.com/accounts/underhold.hild.frozen/money.hold/unhold.accounts.code/securty.code.code/ewrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbmewrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbmndsabfmnbdsmnfbndsabfmnbdsmnfdgdfsgfdgdsgfb/ewrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbmdsafsdafdsfasdfsdfsdfasndsabfmnbdsmnfb/ewrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmndsabfmnbdsmnfb/ewrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmndsabfmnbdsmnfb/eewrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmndsabfmnbdsmnfbwrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmn/eewrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmndsabfmnbdsmnfbwrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmnds/eewrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmndsabfmnbdsmnfbwrlksndafnlkqwkenwkjhjesjnasfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmnds/unholding/fmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmn/fmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmnfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmn/fmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmnfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmn/fmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmnfmkdsbfndsnabfjhsadgrfujhadsjkfnwernbmasbmndbsmnbsamnfbdsmnfbsdafsadfadsfsafsdafdsfsadvvmn/hold.login/login.aspx.htm
Plots and graphs are displayed to find out how the data is distributed and how features are related to each other.
#Plotting the data distribution
data.hist(bins=10, figsize=(20,12), log=True)
plt.show()
We need to shuffle the data to balance out the distribution while splitting it into training and testing sets. This also eliminates the possibility of overfitting during model training.
# shuffling the rows in the dataset so that when splitting the train and test set are equally distributed
data = data.sample(frac=1).reset_index(drop=True)
data.head()
| url | type | url_length | number_of_http | number_of_https | https | short_url | domain_length | number_of_dots | number_of_www | ... | number_of_backslash | number_of_hashtag | number_of_plus | number_of_percentage | number_of_at_sign | number_of_space | number_of_colon | number_of_and | number_of_semicolon | number_of_exclamation | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | http://www.letni-tabor.eu/index.php?view=artic... | defacement | 149 | 1 | 0 | 0 | 1 | 18 | 3 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 8 | 0 | 0 |
| 1 | monette.net/newsite/online/Newsletter2007Fall/... | benign | 53 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | www.cs.yale.edu/homes/tap/Files/hopper-story.html | phishing | 49 | 0 | 0 | 0 | 1 | 0 | 4 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | music.yahoo.com/faun-fables/ | benign | 28 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | ea-download-manager.software.informer.com/ | benign | 42 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
Now that the data wrangling is complete, the data needs to be splitted.
There are several methods for dividing the dataset into test and training sets. The method adopted in this project is to use the identification of each instance to determine whether or not it should be included in the test set. We only need to ensure that new data is added at the end of the dataset and that no row is destroyed with this solution. This has the advantage of not generating a separate test set if the program has to run again.
from zlib import crc32 #can compute the checksum for crc32 (Cyclic Redundancy Check) to a particular data
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_testing_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_testing_set], data.loc[in_testing_set]
# Splitting the dataset into train and test sets: 80-20 split
data_with_id = data.reset_index() #add an `index` column to the dataset
training_set, testing_set = split_train_test_by_id(data_with_id, 0.2, "index")
#0.2 which represents that we only going to take 20%
Now that we have the test and training sets, we need separate them into data and label. Our "y" value is the label, which will just contain the type because that is what we want to predict. The data consists of our "x" values, which represent all of the other properties or attributes.
x_train = training_set.drop(columns=["url","type","index"])
y_train = training_set["type"].copy()
x_test = testing_set.drop(columns=["url","type","index"])
y_test = testing_set["type"].copy()
print('The training dataset shape:', x_train.shape, y_train.shape)
print('The testing dataset shape:', x_test.shape, y_test.shape)
The training dataset shape: (520948, 25) (520948,) The testing dataset shape: (130243, 25) (130243,)
print(len(training_set), "train +", len(testing_set), "test")
520948 train + 130243 test
testing_set["type"].value_counts() / len(testing_set)
benign 0.657579 defacement 0.147985 phishing 0.144261 malware 0.050175 Name: type, dtype: float64
data["type"].value_counts() / len(data)
benign 0.657415 defacement 0.148124 phishing 0.144521 malware 0.049939 Name: type, dtype: float64
As with all transformations, we need to fit the scalers to only the training data and not the entire dataset (including the testing set). The training data is partitioned into numerical and categorical values before being integrated in a comprehensive pipeline.
#To assemble several steps that can be cross-validated together while setting different parameters
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
number_pipeline = Pipeline([
('std_scaler', StandardScaler()),])
url_transform_num_tr = number_pipeline.fit_transform(x_train)
*StandardScaler*
The standard score of a sample x is calculated as: z = (x - u) / s
where "u" is the mean of the training samples or zero if with_mean=False
"s" is the standard deviation of the training samples or one if with_std=False
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set
Mean and standard deviation are then stored to be used on later data using transform.
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data
url_transform_num_tr
array([[ 1.98055635, 1.63248411, -0.15821385, ..., 6.2378656 ,
-0.07378274, -0.02406817],
[-0.15933232, -0.59603705, -0.15821385, ..., -0.31081032,
-0.07378274, -0.02406817],
[-0.71659499, -0.59603705, -0.15821385, ..., -0.31081032,
-0.07378274, -0.02406817],
...,
[ 1.93597534, -0.59603705, -0.15821385, ..., 0.50777417,
1.86268626, -0.02406817],
[-0.22620384, -0.59603705, -0.15821385, ..., -0.31081032,
-0.07378274, -0.02406817],
[-0.22620384, -0.59603705, -0.15821385, ..., -0.31081032,
-0.07378274, -0.02406817]])
#This estimator allows different columns or column subsets of the input to be transformed separately
#the features generated by each transformer will be concatenated to form a single feature spaces
from sklearn.compose import ColumnTransformer
number_attributes = list(x_train)
full_pipeline = ColumnTransformer([
("number", number_pipeline, number_attributes)])
x_train = full_pipeline.fit_transform(training_set)
x_train
array([[ 1.98055635, 1.63248411, -0.15821385, ..., 6.2378656 ,
-0.07378274, -0.02406817],
[-0.15933232, -0.59603705, -0.15821385, ..., -0.31081032,
-0.07378274, -0.02406817],
[-0.71659499, -0.59603705, -0.15821385, ..., -0.31081032,
-0.07378274, -0.02406817],
...,
[ 1.93597534, -0.59603705, -0.15821385, ..., 0.50777417,
1.86268626, -0.02406817],
[-0.22620384, -0.59603705, -0.15821385, ..., -0.31081032,
-0.07378274, -0.02406817],
[-0.22620384, -0.59603705, -0.15821385, ..., -0.31081032,
-0.07378274, -0.02406817]])
x_train.shape
(520948, 25)
We must execute the entire pipeline on the test set as well, but only transform() it depending on the fit performed on the train set. The test set should not be fit transformed().
x_test.shape
x_test = full_pipeline.transform(testing_set)
x_test.shape
(130243, 25)
# check to see if both of the datasets contain the same number of columns
x_train.shape
(520948, 25)
x_test.shape
(130243, 25)
This is a classification problems, the classification Machine learning Models listed below will be used and evaluate to see which outperform than the rest of the models.
#by creating holders to store the model performance results
ML_Model = []
acc_train = []
acc_test = []
#function to call for storing the results
def storeResults(model, a,b):
ML_Model.append(model)
acc_train.append(round(a, 3))
acc_test.append(round(b, 3))
#from sklearn.linear_model import LogisticRegression
logisticReg = LogisticRegression(random_state=13).fit(x_train, y_train)
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score, precision_score, recall_score
from sklearn.model_selection import cross_val_predict, cross_val_score #cross-validation
from statistics import mean
import time #time taken for the model to train
custom_scorer = {'accuracy': make_scorer(accuracy_score),
'precision': make_scorer(precision_score, average='macro'),
'recall': make_scorer(recall_score, average='macro')
}
start = time.time()
scores_logisticReg = sklearn.model_selection.cross_validate(
logisticReg, x_train, y_train,
cv = 5, scoring = custom_scorer, n_jobs=-1) #the execution will take place in parallel so that all CPUs get used
#reduce computation time
for name in scores_logisticReg.keys(): #to retrieve all the scores
average_logisticReg = np.average(scores_logisticReg[name])
print('%s: %.5f' %(name,average_logisticReg))
stop = time.time()
training_time_logistic = stop - start
print(f"Training_time: {training_time_logistic} seconds")
fit_time: 28.94565 score_time: 0.96055 test_accuracy: 0.86710 test_precision: 0.79524 test_recall: 0.74829 Training_time: 32.97947597503662 seconds
#the confusion matrix
#the comparison graphs
accuracy_logisticReg = []
precision_logisticReg = []
recall_logisticReg = []
for k, v in scores_logisticReg.items():
if k == 'test_accuracy':
accuracy_logisticReg.append(v.mean())
if k == 'test_precision':
precision_logisticReg.append(v.mean())
if k == 'test_recall':
recall_logisticReg.append(v.mean())
print('Accuracy of the training set: {}'.format(accuracy_logisticReg))
print('Precision of the training set: {}'.format(precision_logisticReg))
print('Recall of the training set: {}'.format(recall_logisticReg))
Accuracy of the training set: [0.8670999752602654] Precision of the training set: [0.7952409134979124] Recall of the training set: [0.7482943263940587]
#predicting on the testing set
#import classification_report,confusion_matrix
#import precision_recall_fscore_support as score
#classification report which include all the scores
predictions_logisticReg = logisticReg.predict(x_test)
print(classification_report(y_test,predictions_logisticReg))
##extract precision and recall scores
predict_precision_logisticReg,predict_recall_logisticReg,fscore,support=score(y_test,predictions_logisticReg,average='macro')
print('Precision : {:.3f}'.format(predict_precision_logisticReg)) #:.3f - 3 decimal place of the value
print('Recall : {:.3f}'.format(predict_recall_logisticReg))
precision recall f1-score support
benign 0.89 0.97 0.93 85645
defacement 0.87 0.92 0.90 19274
malware 0.69 0.69 0.69 6535
phishing 0.72 0.40 0.52 18789
accuracy 0.87 130243
macro avg 0.80 0.75 0.76 130243
weighted avg 0.86 0.87 0.85 130243
Precision : 0.795
Recall : 0.745
A confusion matrix is generated to better visualize how basic logisticReg model performed on our test set
#cm1 - confusion matrix 1
cm1 = confusion_matrix(y_test, predictions_logisticReg)
print(cm1)
#the diagonal elements are the number of points for which the predicted label equals the true label
#the off-diagonal elements are those for which the classifier mislabeled
predict_accuracy_logisticReg = sum(np.diag(cm1))/sum(sum(cm1))*100
print('{:.3f} accurately classified afer prediction on testing set'.format(predict_accuracy_logisticReg))
[[82978 353 502 1812] [ 581 17816 492 385] [ 532 788 4477 738] [ 8779 1480 971 7559]] 86.630 accurately classified afer prediction on testing set
logisticReg.classes_
array(['benign', 'defacement', 'malware', 'phishing'], dtype=object)
#predicting the target value from the model for the samples
y_test_logisticReg = logisticReg.predict(x_test)
y_train_logisticReg = logisticReg.predict(x_train)
#computing the accuracy of the model performance
acc_train_logisticReg = accuracy_score(y_train,y_train_logisticReg)
acc_test_logisticReg = accuracy_score(y_test,y_test_logisticReg)
print("Logistic Regression: Accuracy on training Data: {:.3f}".format(acc_train_logisticReg))
print("Logistic Regression: Accuracy on testing Data: {:.3f}".format(acc_test_logisticReg))
Logistic Regression: Accuracy on training Data: 0.867 Logistic Regression: Accuracy on testing Data: 0.866
from sklearn import metrics
cm_logisticReg = metrics.confusion_matrix(y_test, predictions_logisticReg)
plt.figure(figsize=(10,10))
sns.heatmap(cm_logisticReg,annot=True, fmt=".5f", linewidths=.5,
square = True, cmap = 'PRGn');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
title_logisticReg = 'Accuracy Score: {}%'.format(predict_accuracy_logisticReg)
plt.title(title_logisticReg, size = 20);
To better understand the labels, another confusion matrix was plotted to make it clear.
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cmd1 = ConfusionMatrixDisplay(cm_logisticReg,
display_labels=['benign', 'defacement','malware','phishing'])
cmd1 .plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2223f8d6d30>
#storing the results
#execute only once to avoid duplications.
storeResults('Logistic Regression', acc_train_logisticReg, acc_test_logisticReg)
By following the Logistic Regression model, we can see that the accuracy after cross validation on the training set is 86.710%, and the model took about 33 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 86.630%.
# Decision Tree model
from sklearn.tree import DecisionTreeClassifier
start = time.time()
decisionTree = DecisionTreeClassifier(random_state=13)
decisionTree.fit(x_train,y_train)
scores_decisionTree = sklearn.model_selection.cross_validate(
decisionTree, x_train, y_train,
cv = 5, scoring = custom_scorer)
for name in scores_decisionTree.keys():
average_decisionTree = np.average(scores_decisionTree[name])
print('%s: %.5f' %(name,average_decisionTree))
stop = time.time()
training_time_decision = stop - start
print(f"Training_time: {training_time_decision} seconds")
fit_time: 2.10654 score_time: 0.89889 test_accuracy: 0.91816 test_precision: 0.91173 test_recall: 0.86579 Training_time: 18.339083909988403 seconds
#This is to be used for the confusion matrix and comparison graphs
accuracy_decisionTree = []
precision_decisionTree = []
recall_decisionTree = []
for k, v in scores_decisionTree.items():
if k == 'test_accuracy':
accuracy_decisionTree.append(v.mean())
if k == 'test_precision':
precision_decisionTree.append(v.mean())
if k == 'test_recall':
recall_decisionTree.append(v.mean())
print('Accuracy on training set: {}'.format(accuracy_decisionTree))
print('Precision on training set: {}'.format(precision_decisionTree))
print('Recall on training set: {}'.format(recall_decisionTree))
Accuracy on training set: [0.9181588157760997] Precision on training set: [0.9117281399900303] Recall on training set: [0.8657892332100353]
On the test set, the basic trained Decision Tree is used to predict.
#classification report
predictions_decisionTree = decisionTree.predict(x_test)
print(classification_report(y_test,predictions_decisionTree))
##extract precision and recall scores
predict_precision_decisionTree,predict_recall_decisionTree,fscore,support=score(y_test,predictions_decisionTree,average='macro')
print('Precision : {:.5f}'.format(predict_precision_decisionTree))
print('Recall : {:.5f}'.format(predict_recall_decisionTree))
precision recall f1-score support
benign 0.92 0.98 0.95 85645
defacement 0.96 0.97 0.97 19274
malware 0.95 0.93 0.94 6535
phishing 0.83 0.59 0.69 18789
accuracy 0.92 130243
macro avg 0.92 0.87 0.89 130243
weighted avg 0.92 0.92 0.91 130243
Precision : 0.91503
Recall : 0.86670
#cm2 - confusion matrix 2
cm2 = confusion_matrix(y_test, predictions_decisionTree)
print(cm2)
predict_accuracy_decisionTree = sum(np.diag(cm2))/sum(sum(cm2))*100
print('{:.3f} accurately classified afer prediction on testing set'.format(predict_accuracy_decisionTree))
[[83837 57 42 1709] [ 74 18732 70 398] [ 259 71 6051 154] [ 6830 654 218 11087]] 91.911 accurately classified afer prediction on testing set
#predicting the target value from the model for the samples
y_test_decisionTree = decisionTree.predict(x_test)
y_train_decisionTree = decisionTree.predict(x_train)
#computing the accuracy of the model performance
acc_train_decisionTree = accuracy_score(y_train,y_train_decisionTree)
acc_test_decisionTree = accuracy_score(y_test,y_test_decisionTree)
print("Decision Tree: Accuracy on training Data: {:.3f}".format(acc_train_decisionTree))
print("Decision Tree: Accuracy on testing Data: {:.3f}".format(acc_test_decisionTree))
Decision Tree: Accuracy on training Data: 0.937 Decision Tree: Accuracy on testing Data: 0.919
cm_decisionTree = metrics.confusion_matrix(y_test, predictions_decisionTree)
plt.figure(figsize=(10,10))
sns.heatmap(cm_decisionTree, annot=True, fmt=".5f", linewidths=.5,
square = True, cmap = 'PuRd_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
decisionTree_title = 'Accuracy Score: {0}'.format(predict_accuracy_decisionTree)
plt.title(decisionTree_title, size = 20);
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Decision Tree', acc_train_decisionTree, acc_test_decisionTree)
By following the Decision Tree Classifier, we can see that the accuracy after cross validation on the training set is 91.816%, and the model took about 18 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 91.910%.
from sklearn.ensemble import RandomForestClassifier
start = time.time()
RFClassifier = RandomForestClassifier(random_state=13)
RFClassifier.fit(x_train,y_train)
scores_RFClassifier = sklearn.model_selection.cross_validate(
RFClassifier, x_train, y_train,
cv = 5, scoring = custom_scorer, n_jobs=-1)
for name in scores_RFClassifier.keys():
average_RFClassifier = np.average(scores_RFClassifier[name])
print('%s: %.5f' %(name,average_RFClassifier))
stop = time.time()
training_time_rf = stop - start
print(f"Training_time: {training_time_rf} seconds")
fit_time: 81.31676 score_time: 2.93961 test_accuracy: 0.92240 test_precision: 0.92281 test_recall: 0.87187 Training_time: 138.48379015922546 seconds
#This is to be used for the confusion matrix and comparison graphs
accuracy_RFClassifier = []
precision_RFClassifier = []
recall_RFClassifier = []
for k, v in scores_RFClassifier.items():
if k == 'test_accuracy':
accuracy_RFClassifier.append(v.mean())
if k == 'test_precision':
precision_RFClassifier.append(v.mean())
if k == 'test_recall':
recall_RFClassifier.append(v.mean())
accuracy_RFClassifier
[0.9223972419587643]
precision_RFClassifier
[0.9228057714838259]
recall_RFClassifier
[0.8718725129664431]
#classification report
predictions_RFClassifier = RFClassifier.predict(x_test)
print(classification_report(y_test,predictions_RFClassifier))
#extract precision and recall scores
predict_precision_RFClassifier,predict_recall_RFClassifier,fscore,support=score(y_test,predictions_RFClassifier,average='macro')
print('Precision : {:.3f}'.format(predict_precision_RFClassifier))
print('Recall : {:.3f}'.format(predict_recall_RFClassifier))
precision recall f1-score support
benign 0.92 0.98 0.95 85645
defacement 0.96 0.98 0.97 19274
malware 0.96 0.93 0.95 6535
phishing 0.85 0.60 0.71 18789
accuracy 0.92 130243
macro avg 0.92 0.87 0.89 130243
weighted avg 0.92 0.92 0.92 130243
Precision : 0.924
Recall : 0.872
#cm3 - confusion matrix 3
cm3 = confusion_matrix(y_test, predictions_RFClassifier)
print(cm3)
predict_accuracy_RFClassifier = sum(np.diag(cm3))/sum(sum(cm3))*100
print('{:.3f} accurately classified afer prediction on testing set'.format(predict_accuracy_RFClassifier))
[[83953 30 31 1631] [ 74 18850 63 287] [ 260 69 6053 153] [ 6705 600 126 11358]] 92.300 accurately classified afer prediction on testing set
#predicting the target value from the model for the samples
y_test_RFClassifier = RFClassifier.predict(x_test)
y_train_RFClassifier = RFClassifier.predict(x_train)
#computing the accuracy of the model performance
acc_train_RFClassifier = accuracy_score(y_train,y_train_RFClassifier)
acc_test_RFClassifier = accuracy_score(y_test,y_test_RFClassifier)
print("Random Forest: Accuracy on training Data: {:.3f}".format(acc_train_RFClassifier))
print("Random Forest: Accuracy on testing Data: {:.3f}".format(acc_test_RFClassifier))
Random Forest: Accuracy on training Data: 0.937 Random Forest: Accuracy on testing Data: 0.923
cm_RFClassifier = metrics.confusion_matrix(y_test, predictions_RFClassifier)
plt.figure(figsize=(10,10))
sns.heatmap(cm_RFClassifier, annot=True, fmt=".5f", linewidths=.5,
square = True, cmap = 'RdPu_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
title_RFClassifier = 'Accuracy Score: {0}'.format(predict_accuracy_RFClassifier)
plt.title(title_RFClassifier, size = 20);
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Random Forest', acc_train_RFClassifier, acc_test_RFClassifier)
By following the Random Forest, we can see that the accuracy after cross validation on the training set is 92.240%, and the model took about 138 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 92.299%.
RFClassifier.feature_importances_
array([0.15310116, 0.05850392, 0.0088361 , 0.00785813, 0.00428436,
0.18284737, 0.06210948, 0.14899046, 0.02698251, 0.05379551,
0.01176025, 0.02034037, 0.01597618, 0.04571767, 0.07297963,
0.00106363, 0.00044716, 0.00247136, 0.01304632, 0.00101905,
0.00019046, 0.07918173, 0.02176916, 0.00636269, 0.00036535])
# Multilayer Perceptrons model
#from sklearn.neural_network import MLPClassifier
start = time.time()
MultiClassifier = MLPClassifier()
MultiClassifier.fit(x_train,y_train)
scores_MultiClassifier = sklearn.model_selection.cross_validate(
MultiClassifier, x_train, y_train,
cv = 5, scoring = custom_scorer, n_jobs=-1)
for name in scores_MultiClassifier.keys():
average_MultiClassifier = np.average(scores_MultiClassifier[name])
print('%s: %.5f' %(name,average_MultiClassifier))
stop = time.time()
training_time_MLPC = stop - start
print(f"Training_time: {training_time_MLPC} seconds")
fit_time: 204.75199 score_time: 0.89120 test_accuracy: 0.91215 test_precision: 0.90384 test_recall: 0.84324 Training_time: 556.1540832519531 seconds
#This is to be used for the confusion matrix and comparison graphs
accuracy_MLPClassifier = []
precision_MLPClassifier = []
recall_MLPClassifier = []
for k, v in scores_MultiClassifier.items():
if k == 'test_accuracy':
accuracy_MLPClassifier.append(v.mean())
if k == 'test_precision':
precision_MLPClassifier.append(v.mean())
if k == 'test_recall':
recall_MLPClassifier.append(v.mean())
accuracy_MLPClassifier
[0.9121524552910115]
precision_MLPClassifier
[0.9038412439771903]
recall_MLPClassifier
[0.8432389047886065]
#classification report
predictions_MultiClassifier = MultiClassifier.predict(x_test)
print(classification_report(y_test,predictions_MultiClassifier))
#extract precision and recall scores
predict_precision_MultiClassifier,predict_recall_MultiClassifier,fscore,support=score(y_test,predictions_MultiClassifier,average='macro')
print('Precision : {:.5f}'.format(predict_precision_MultiClassifier))
print('Recall : {:.5f}'.format(predict_recall_MultiClassifier))
precision recall f1-score support
benign 0.92 0.98 0.95 85645
defacement 0.94 0.97 0.95 19274
malware 0.94 0.87 0.90 6535
phishing 0.83 0.57 0.68 18789
accuracy 0.91 130243
macro avg 0.91 0.85 0.87 130243
weighted avg 0.91 0.91 0.91 130243
Precision : 0.90707
Recall : 0.84783
#cm4 - confusion matrix 4
cm4 = confusion_matrix(y_test, predictions_MultiClassifier)
print(cm4)
predict_accuracy_MultiClassifier = sum(np.diag(cm4))/sum(sum(cm4))*100
print('{:.3f} accurately classified afer prediction on testing set'.format(predict_accuracy_MultiClassifier))
[[83946 38 48 1613] [ 157 18752 111 254] [ 321 279 5691 244] [ 6976 932 220 10661]] 91.406 accurately classified afer prediction on testing set
#predicting the target value from the model for the samples
y_test_MultiClassifier = MultiClassifier.predict(x_test)
y_train_MultiClassifier = MultiClassifier.predict(x_train)
#computing the accuracy of the model performance
acc_train_MultiClassifier = accuracy_score(y_train,y_train_MultiClassifier)
acc_test_MultiClassifier = accuracy_score(y_test,y_test_MultiClassifier)
print("Multi-Layer Perceptrons: Accuracy on training Data: {:.3f}".format(acc_train_MultiClassifier))
print("Multi-Layer Perceptrons: Accuracy on testing Data: {:.3f}".format(acc_test_MultiClassifier))
Multi-Layer Perceptrons: Accuracy on training Data: 0.916 Multi-Layer Perceptrons: Accuracy on testing Data: 0.914
cm_MultiClassifier = metrics.confusion_matrix(y_test, predictions_MultiClassifier)
plt.figure(figsize=(10,10))
sns.heatmap(cm_MultiClassifier, annot=True, fmt=".5f", linewidths=.5,
square = True, cmap = 'hot');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
title_MultiClassifier = 'Accuracy Score: {0}'.format(predict_accuracy_MultiClassifier)
plt.title(title_MultiClassifier, size = 20);
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Multilayer Perceptrons', acc_train_MLPClassifier, acc_test_MLPClassifier)
By following the Multilayer Perceptrons, we can see that the accuracy after cross validation on the training set is 91.215%, and the model took about 556 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 91.406%.
# GradientBoostingClassifier
#from sklearn.ensemble import GradientBoostingClassifier
start = time.time()
GBClassifier = GradientBoostingClassifier()
GBClassifier.fit(x_train,y_train)
scores_GBClassifier = sklearn.model_selection.cross_validate(
GBClassifier, x_train, y_train,
cv = 5, scoring = custom_scorer, n_jobs=-1)
for name in scores_GBClassifier.keys():
average_GBClassifier = np.average(scores_GBClassifier[name])
print('%s: %.5f' %(name,average_GBClassifier))
stop = time.time()
training_time_GB = stop - start
print(f"Training_time: {training_time_GB} seconds")
fit_time: 718.59068 score_time: 1.82676 test_accuracy: 0.89883 test_precision: 0.88374 test_recall: 0.80707 Training_time: 945.658388376236 seconds
#confusion matrix
#comparison graphs
accuracy_GradientBoostingClassifier = []
precision_GradientBoostingClassifier = []
recall_GradientBoostingClassifier = []
for k, v in scores_GBClassifier.items():
if k == 'test_accuracy':
accuracy_GradientBoostingClassifier.append(v.mean())
if k == 'test_precision':
precision_GradientBoostingClassifier.append(v.mean())
if k == 'test_recall':
recall_GradientBoostingClassifier.append(v.mean())
accuracy_GradientBoostingClassifier
[0.8988344286074037]
precision_GradientBoostingClassifier
[0.8837379412827033]
recall_GradientBoostingClassifier
[0.8070690974449294]
#classification report
predictions_GBClassifier = GBClassifier.predict(x_test)
print(classification_report(y_test,predictions_GBClassifier))
#extract precision and recall scores
predict_precision_GBClassifier,predict_recall_GBClassifier,fscore,support=score(y_test,predictions_GBClassifier,average='macro')
print('Precision : {:.5f}'.format(predict_precision_GBClassifier))
print('Recall : {:.5f}'.format(predict_recall_GBClassifier))
precision recall f1-score support
benign 0.91 0.98 0.94 85645
defacement 0.90 0.96 0.93 19274
malware 0.90 0.79 0.84 6535
phishing 0.83 0.48 0.60 18789
accuracy 0.90 130243
macro avg 0.88 0.80 0.83 130243
weighted avg 0.89 0.90 0.89 130243
Precision : 0.88353
Recall : 0.80299
#cm5 - confusion matrix 5
cm5 = confusion_matrix(y_test, predictions_GBClassifier)
print(cm5)
predict_accuracy_GBClassifier = sum(np.diag(cm5))/sum(sum(cm5))*100
print('{:.3f} accurately classified afer prediction on testing set'.format(predict_accuracy_GBClassifier))
[[84304 54 49 1238] [ 227 18589 170 288] [ 394 632 5144 365] [ 8192 1299 354 8944]] 89.817 accurately classified afer prediction on testing set
#predicting the target value from the model for the samples
y_test_GBClassifier = GBClassifier.predict(x_test)
y_train_GBClassifier = GBClassifier.predict(x_train)
#computing the accuracy of the model performance
acc_train_GBClassifier = accuracy_score(y_train,y_train_GBClassifier)
acc_test_GBClassifier = accuracy_score(y_test,y_test_GBClassifier)
print("Gradient Boosting Classifier: Accuracy on training Data: {:.3f}".format(acc_train_GBClassifier))
print("Gradient Boosting Classifier: Accuracy on testing Data: {:.3f}".format(acc_test_GBClassifier))
Gradient Boosting Classifier: Accuracy on training Data: 0.899 Gradient Boosting Classifier: Accuracy on testing Data: 0.898
cm_GBClassifier = metrics.confusion_matrix(y_test, predictions_GBClassifier)
plt.figure(figsize=(10,10))
sns.heatmap(cm_GBClassifier, annot=True, fmt=".5f", linewidths=.5,
square = True, cmap = 'YlGn_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
title_GBClassifier = 'Accuracy Score: {0}'.format(predict_accuracy_GBClassifier)
plt.title(title_GBClassifier, size = 20);
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('GradientBoostingClassifier', acc_train_GBClassifier, acc_test_GBClassifier)
By following the Gradient Boosting Classifier, we can see that the accuracy after cross validation on the training set is 89.883%, and the model took about 946 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 89.817%.
#from sklearn.naive_bayes import GaussianNB
start = time.time()
NBClassifier = GaussianNB()
NBClassifier.fit(x_train,y_train)
scores_NBClassifier = sklearn.model_selection.cross_validate(
NBClassifier, x_train, y_train,
cv = 5, scoring = custom_scorer, n_jobs=-1)
for name in scores_NBClassifier.keys():
average_NBClassifier = np.average(scores_NBClassifier[name])
print('%s: %.5f' %(name,average_NBClassifier))
stop = time.time()
training_time_NB = stop - start
print(f"Training_time: {training_time_NB} seconds")
fit_time: 1.21066 score_time: 1.21176 test_accuracy: 0.69875 test_precision: 0.62619 test_recall: 0.59812 Training_time: 5.953402757644653 seconds
#This is to be used for the confusion matrix and comparison graphs
accuracy_GaussianNB = []
precision_GaussianNB = []
recall_GaussianNB = []
for k, v in scores_NBClassifier.items():
if k == 'test_accuracy':
accuracy_GaussianNB.append(v.mean())
if k == 'test_precision':
precision_GaussianNB.append(v.mean())
if k == 'test_recall':
recall_GaussianNB.append(v.mean())
accuracy_GaussianNB
[0.6987476363662506]
precision_GaussianNB
[0.6261854888885325]
recall_GaussianNB
[0.5981215502904451]
#classification report
predictions_NBClassifier = NBClassifier.predict(x_test)
print(classification_report(y_test,predictions_NBClassifier))
#extract precision and recall scores
predict_precision_NBClassifier,predict_recall_NBClassifier,fscore,support=score(y_test,predictions_NBClassifier,average='macro')
print('Precision : {:.3f}'.format(predict_precision_NBClassifier))
print('Recall : {:.3f}'.format(predict_recall_NBClassifier))
precision recall f1-score support
benign 0.90 0.88 0.89 85645
defacement 0.61 1.00 0.75 19274
malware 0.35 0.33 0.34 6535
phishing 0.71 0.34 0.46 18789
accuracy 0.79 130243
macro avg 0.64 0.64 0.61 130243
weighted avg 0.80 0.79 0.78 130243
Precision : 0.641
Recall : 0.636
#cm - confusion matrix 6
cm6 = confusion_matrix(y_test, predictions_NBClassifier)
print(cm6)
predict_accuracy_NBClassifier = sum(np.diag(cm6))/sum(sum(cm6))*100
print('{:.3f} accurately classified afer prediction on testing set'.format(predict_accuracy_NBClassifier))
[[74948 5455 2620 2622] [ 34 19224 14 2] [ 267 4054 2160 54] [ 8046 2973 1339 6431]] 78.901 accurately classified afer prediction on testing set
#predicting the target value from the model for the samples
y_test_NBClassifier = NBClassifier.predict(x_test)
y_train_NBClassifier = NBClassifier.predict(x_train)
#computing the accuracy of the model performance
acc_train_NBClassifier = accuracy_score(y_train,y_train_NBClassifier)
acc_test_NBClassifier = accuracy_score(y_test,y_test_NBClassifier)
print("Naive Bayes: Accuracy on training Data: {:.3f}".format(acc_train_NBClassifier))
print("Naive Bayes: Accuracy on testing Data: {:.3f}".format(acc_test_NBClassifier))
Naive Bayes: Accuracy on training Data: 0.790 Naive Bayes: Accuracy on testing Data: 0.789
cm_NBClassifier = metrics.confusion_matrix(y_test, predictions_NBClassifier)
plt.figure(figsize=(10,10))
sns.heatmap(cm_NBClassifier, annot=True, fmt=".5f", linewidths=.5,
square = True, cmap = 'YlGnBu_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
title_NBClassifier = 'Accuracy Score: {0}'.format(predict_accuracy_NBClassifier)
plt.title(title_NBClassifier, size = 20);
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Naive Bayes', acc_train_NBClassifier, acc_test_NBClassifier)
By following the Naive Bayes, we can see that the accuracy after cross validation on the training set is 69.875%, and the model took about 6 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 78.900%.
#from sklearn.linear_model import SGDClassifier
start = time.time()
SGDC = SGDClassifier()
SGDC.fit(x_train,y_train)
scores_SGDC = sklearn.model_selection.cross_validate(
SGDC, x_train, y_train,
cv = 5, scoring = custom_scorer, n_jobs=-1)
for name in scores_SGDC.keys():
average_SGDC = np.average(scores_SGDC[name])
print('%s: %.5f' %(name,average_SGDC))
stop = time.time()
training_time_SGDC = stop - start
print(f"Training_time: {training_time_SGDC} seconds")
fit_time: 5.08507 score_time: 0.77938 test_accuracy: 0.85894 test_precision: 0.77922 test_recall: 0.74578 Training_time: 13.368785858154297 seconds
#confusion matrix
#comparison graphs
accuracy_SGDClassifier = []
precision_SGDClassifier = []
recall_SGDClassifier = []
for k, v in scores_SGDC.items():
if k == 'test_accuracy':
accuracy_SGDClassifier.append(v.mean())
if k == 'test_precision':
precision_SGDClassifier.append(v.mean())
if k == 'test_recall':
recall_SGDClassifier.append(v.mean())
accuracy_SGDClassifier
[0.8589436893005405]
precision_SGDClassifier
[0.7792184803287364]
recall_SGDClassifier
[0.745780139428034]
#Creating a classification report with all the scores
predictions_SGDClassifier = SGDC.predict(x_test)
print(classification_report(y_test,predictions_SGDClassifier))
#Extracting only precision and recall scores to be used later on
predict_precision_SGDClassifier,predict_recall_SGDClassifier,fscore,support=score(y_test,predictions_SGDClassifier,average='macro')
print('Precision : {:.3f}'.format(predict_precision_SGDClassifier))
print('Recall : {:.3f}'.format(predict_recall_SGDClassifier))
precision recall f1-score support
benign 0.89 0.96 0.93 85645
defacement 0.85 0.92 0.88 19274
malware 0.62 0.74 0.68 6535
phishing 0.74 0.36 0.49 18789
accuracy 0.86 130243
macro avg 0.78 0.74 0.74 130243
weighted avg 0.85 0.86 0.84 130243
Precision : 0.775
Recall : 0.745
cm7 = confusion_matrix(y_test, predictions_SGDClassifier)
print(cm7)
predict_accuracy_SGDClassifier = sum(np.diag(cm7))/sum(sum(cm7))*100
print('{:.3f} accurately classified afer prediction on testing set'.format(predict_accuracy_SGDClassifier))
[[82404 428 501 2312] [ 867 17639 766 2] [ 686 916 4816 117] [ 8484 1823 1641 6841]] 85.763 accurately classified afer prediction on testing set
#predicting the target value from the model for the samples
y_test_SGDC = SGDC.predict(x_test)
y_train_SGDC = SGDC.predict(x_train)
#computing the accuracy of the model performance
acc_train_SGDC = accuracy_score(y_train,y_train_SGDC)
acc_test_SGDC = accuracy_score(y_test,y_test_SGDC)
print("Gradient Boosting Classifier: Accuracy on training Data: {:.3f}".format(acc_train_SGDC))
print("Gradient Boosting Classifier: Accuracy on testing Data: {:.3f}".format(acc_test_SGDC))
Gradient Boosting Classifier: Accuracy on training Data: 0.858 Gradient Boosting Classifier: Accuracy on testing Data: 0.858
cm_SGDClassifier = metrics.confusion_matrix(y_test, predictions_SGDClassifier)
plt.figure(figsize=(10,10))
sns.heatmap(cm_SGDClassifier, annot=True, fmt=".5f", linewidths=.5,
square = True, cmap = 'PiYG');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
title_SGDClassifier = 'Accuracy Score: {0}'.format(predict_accuracy_SGDClassifier)
plt.title(title_SGDClassifier, size = 20);
#storing the results. The below mentioned order of parameter passing is important.
#Caution: Execute only once to avoid duplications.
storeResults('Stochastic Gradient Descent Classifiers', acc_train_SGDC, acc_test_SGDC)
By following the Stochastic Gradient Descent Classifiers, we can see that the accuracy after cross validation on the training set is 85.894%, and the model took about 556 seconds to run. After predicting on the testing set, the accuracy of accurately classified classes is 85.763%.
After we have all the training data, it's time to put it all together and visualize it and make comparison
accuracies = [['logisticReg', accuracy_logisticReg],
['decisionTree', accuracy_decisionTree],
['RFClassifier', accuracy_RFClassifier],
['MLPClassifier', accuracy_MLPClassifier],
['GradientBoostingClassifier', accuracy_GradientBoostingClassifier],
['GaussianNB', accuracy_GaussianNB],
['SGDClassifier', accuracy_SGDClassifier]]
precision =[precision_logisticReg,
precision_decisionTree,
precision_RFClassifier,
precision_MLPClassifier,
precision_GradientBoostingClassifier,
precision_GaussianNB,
precision_SGDClassifier]
recall = [recall_logisticReg,
recall_decisionTree,
recall_RFClassifier,
recall_MLPClassifier,
recall_GradientBoostingClassifier,
recall_GaussianNB,
recall_SGDClassifier]
df_comparison_scores = pd.DataFrame (accuracies, columns = ['model', 'accuracy'])
df_comparison_scores['precision'] = precision
df_comparison_scores['recall'] = recall
df_comparison_scores["accuracy"] = df_comparison_scores["accuracy"].str.get(0)
df_comparison_scores["precision"] = df_comparison_scores["precision"].str.get(0)
df_comparison_scores["recall"] = df_comparison_scores["recall"].str.get(0)
df_comparison_scores
| model | accuracy | precision | recall | |
|---|---|---|---|---|
| 0 | logisticReg | 0.867100 | 0.795241 | 0.748294 |
| 1 | decisionTree | 0.918159 | 0.911728 | 0.865789 |
| 2 | RFClassifier | 0.922397 | 0.922806 | 0.871873 |
| 3 | MLPClassifier | 0.912152 | 0.903841 | 0.843239 |
| 4 | GradientBoostingClassifier | 0.898834 | 0.883738 | 0.807069 |
| 5 | GaussianNB | 0.698748 | 0.626185 | 0.598122 |
| 6 | SGDClassifier | 0.858944 | 0.779218 | 0.745780 |
Based on the table shown above, the Random Forest model which is 92.23% which is the highest among all the models
df_comparison_scores = pd.melt(df_comparison_scores, id_vars="model", var_name="score_names", value_name="scores")
df_comparison_scores
| model | score_names | scores | |
|---|---|---|---|
| 0 | logisticReg | accuracy | 0.867100 |
| 1 | decisionTree | accuracy | 0.918159 |
| 2 | RFClassifier | accuracy | 0.922397 |
| 3 | MLPClassifier | accuracy | 0.912152 |
| 4 | GradientBoostingClassifier | accuracy | 0.898834 |
| 5 | GaussianNB | accuracy | 0.698748 |
| 6 | SGDClassifier | accuracy | 0.858944 |
| 7 | logisticReg | precision | 0.795241 |
| 8 | decisionTree | precision | 0.911728 |
| 9 | RFClassifier | precision | 0.922806 |
| 10 | MLPClassifier | precision | 0.903841 |
| 11 | GradientBoostingClassifier | precision | 0.883738 |
| 12 | GaussianNB | precision | 0.626185 |
| 13 | SGDClassifier | precision | 0.779218 |
| 14 | logisticReg | recall | 0.748294 |
| 15 | decisionTree | recall | 0.865789 |
| 16 | RFClassifier | recall | 0.871873 |
| 17 | MLPClassifier | recall | 0.843239 |
| 18 | GradientBoostingClassifier | recall | 0.807069 |
| 19 | GaussianNB | recall | 0.598122 |
| 20 | SGDClassifier | recall | 0.745780 |
From the above table, Random Forest has the highest accuracy, precision and recall scores among all the other model as well
ax = sns.factorplot(x='model', y='scores', hue='score_names', palette='rainbow',size=4.5,aspect=2.5,
data=df_comparison_scores, kind='bar',
order=['RFClassifier','decisionTree',
'logisticReg','MLPClassifier','GradientBoostingClassifier','GaussianNB','SGDClassifier'])
_ = ax.fig.suptitle("The training scores of the different classification models",
fontsize=20, fontdict={"weight": "bold"})
_ = ax.set_axis_labels(x_var="Models", y_var="Scores")
The above barchart is in descending order based on the accuracy.
#training times
training_time = [['logisticReg', training_time_logistic],
['decisionTree', training_time_decision],
['RFClassifier', training_time_rf],
['MLPClassifier',training_time_MLPC],
['GradientBoostingClassifier', training_time_GB],
['GaussianNB',training_time_NB],
['SGDClassifier', training_time_SGDC]]
df_comparison_times = pd.DataFrame (training_time,
columns = ['model', 'training_time'])
df_comparison_times
| model | training_time | |
|---|---|---|
| 0 | logisticReg | 32.979476 |
| 1 | decisionTree | 18.339084 |
| 2 | RFClassifier | 138.483790 |
| 3 | MLPClassifier | 556.154083 |
| 4 | GradientBoostingClassifier | 945.658388 |
| 5 | GaussianNB | 5.953403 |
| 6 | SGDClassifier | 13.368786 |
This is a different way to compare the models by checking their training time. From the table it shows that Naive Bayes took the shortest time to train the model whereas Gradient Boosting Classifier took the longest time to train.
def show_values(axs, orient="v", space=.01):
def _single(ax):
if orient == "v":
for p in ax.patches:
_x = p.get_x() + p.get_width() / 2
_y = p.get_y() + p.get_height() + (p.get_height()*0.01)
value = '{:.3f}'.format(p.get_height())
ax.text(_x, _y, value, ha="center")
elif orient == "h":
for p in ax.patches:
_x = p.get_x() + p.get_width() + float(space)
_y = p.get_y() + p.get_height() - (p.get_height()*0.5)
value = '{:.4f}'.format(p.get_width())
ax.text(_x, _y, value, ha="left")
if isinstance(axs, np.ndarray):
for idx, ax in np.ndenumerate(axs):
_single(ax)
else:
_single(axs)
ax = sns.barplot(x="model", y="training_time", data=df_comparison_times, palette='Set1',
order=df_comparison_times.sort_values('training_time', ascending = False).model)
_ = ax.set_xlabel("Machine Learning Models")
_ = ax.set_ylabel("Time (in seconds)")
_ = ax.set_title("Training Time of the Models")
show_values(ax)
from sklearn.model_selection import RandomizedSearchCV
#No. of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 300, num = 10)]
maximum_features = [12,16,21] ## no. features to consider at every split
#The function to measure the quality of a split.
#Supported criteria are “gini” for the Gini impurity, “log_loss” and “entropy” both for the Shannon information gain
criterion=['entropy','gini']
# Maximum no.of levels in tree
#If None, then nodes are expanded until all leaves are pure
#or until all leaves contain less than min_samples_split samples.
maximum_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
maximum_depth.append(None)
#Method to choose the samples for training each of the tree
#Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
bootstrap = [True, False]
# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': maximum_features,
'max_depth': maximum_depth,
'criterion': criterion,
'bootstrap':bootstrap}
#Use the random_grid created to search for best hyperparameters
#the best model among all the other ones
RFClassifier
#Apply the RandomizedSearchCV
RFClassifier_random_new = RandomizedSearchCV(estimator = RFClassifier,
param_distributions = random_grid,
n_iter = 10, cv = 3, verbose=2, random_state=13, n_jobs = -1)
# Fit the random search model
RFClassifier_random_new.fit(x_train, y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(random_state=13),
n_jobs=-1,
param_distributions={'bootstrap': [True, False],
'criterion': ['entropy', 'gini'],
'max_depth': [10, 20, 30, 40, 50, 60,
70, 80, 90, 100, 110,
None],
'max_features': [12, 16, 21],
'n_estimators': [100, 122, 144, 166,
188, 211, 233, 255,
277, 300]},
random_state=13, verbose=2)
RFClassifier_random_new.best_params_
{'n_estimators': 277,
'max_features': 12,
'max_depth': 100,
'criterion': 'gini',
'bootstrap': True}
#the accuracy of the basic RFClassifier
print("Accuracy of the basic RFCLassifier training model: {}%".format(accuracy_RFClassifier))
print("Accuracy of the basic RFCLassifier predicted on testing model: {:.3f}%".format(predict_accuracy_RFClassifier))
Accuracy of the basic RFCLassifier training model: [0.9223972419587643]% Accuracy of the basic RFCLassifier predicted on testing model: 92.300%
best_random_grid = RFClassifier_random_new.best_estimator_
best_random_grid
RandomForestClassifier(max_depth=100, max_features=12, n_estimators=277,
random_state=13)
start = time.time()
best_random_RFClassifier = RandomForestClassifier(bootstrap=False,
max_depth=70,
max_features=12,
n_estimators=277, random_state=13,
n_jobs=-1).fit(x_train,y_train)
best_random_accuracy_rf = best_random_RFClassifier.score(x_train,y_train)
print("The Accuracy of the fine-tuned RFClassifier on training model: {:.5f}%".format(best_random_accuracy_rf*100))
predictions_random_rf = best_random_RFClassifier.predict(x_test)
best_random_accuracy_predict_rf = accuracy_score(y_test, predictions_random_rf)
print("The Accuracy of the fine-tuned RFClassifier on testing model: {:.5f}%".format(best_random_accuracy_predict_rf*100))
stop = time.time()
training_time_random_rf = stop - start
print("Training_time: {:.3f} seconds".format(training_time_random_rf))
The Accuracy of the fine-tuned RFClassifier on training model: 93.69073% The Accuracy of the fine-tuned RFClassifier on testing model: 92.24219% Training_time: 157.590 seconds
cm1_best_random_rf = metrics.confusion_matrix(y_test, predictions_random_rf)
plt.figure(figsize=(10,10))
sns.heatmap(cm1_best_random_rf, annot=True, fmt=".3f", linewidths=.5, square = True, cmap = 'Blues_r');
plt.ylabel('Actual label');
plt.xlabel('Predicted label');
title_best_random_rf = 'Accuracy Score: {:.5f} %'.format(best_random_accuracy_predict_rf*100)
plt.title(title_best_random_rf, size = 20);
from sklearn.ensemble import VotingClassifier
classifier1 = RFClassifier
classifier2 = decisionTree
classifier3 = MultiClassifier
vclassifier = VotingClassifier(
estimators=[('rf', RFClassifier), ('dt', decisionTree), ('mlp', MultiClassifier)],
voting='hard')
for classifier, label in zip([classifier1, classifier2, classifier3, vclassifier],
['Random Forest', 'Decision Tree', 'Multi-layer perceptrons', 'Ensemble']):
scores = cross_val_score(classifier, x_train, y_train, scoring='accuracy', cv=5)
print("Accuracy: %0.2f [%s]" % (scores.mean(), label))
Accuracy: 0.92 [Random Forest] Accuracy: 0.92 [Decision Tree] Accuracy: 0.91 [Multi-layer perceptrons] Accuracy: 0.92 [Ensemble]
When we compare the basic models to the randomizedgridsearch, we can see that the randomizedgridsearch outperformed or was comparable to the basic model.
#pip install tabulate
from tabulate import tabulate
The_training_accuracy_table = tabulate([
['accuracy_RFClassifier', accuracy_RFClassifier],
['best_random_accuracy_rf', best_random_accuracy_rf]],
headers=['Training Model Name', 'Accuracy'])
print(The_training_accuracy_table)
Training Model Name Accuracy ----------------------- -------------------- accuracy_RFClassifier [0.9223972419587643] best_random_accuracy_rf 0.936907330482121
The_prediction_accuracy_table = tabulate([
['predict_accuracy_RFClassifier', predict_accuracy_RFClassifier],
['best_random_accuracy_predict_rf', best_random_accuracy_predict_rf*100]],
headers=['Predicted Model Name', 'Accuracy'])
print(The_prediction_accuracy_table)
Predicted Model Name Accuracy ------------------------------- ---------- predict_accuracy_RFClassifier 92.2998 best_random_accuracy_predict_rf 92.2422
When all three models are combined, the ensemble performs admirably.
The_accuracy_of_basic_ensemble = tabulate([
['accuracy_Random_Forest', 0.93],
['accuracy_Decision_Tree', 0.92],
['accuracy_MLP', 0.92],
['accuracy_Ensemble', 0.93],],
headers=['Model', 'Accuracy'])
print(accuracy_basic_ensemble)
Model Accuracy ---------------------- ---------- accuracy_Random_Forest 0.93 accuracy_Decision_Tree 0.92 accuracy_MLP 0.92 accuracy_Ensemble 0.93
This is consider successfully which it has achieved an overall accuracy score of 93% by using Random Forest Classifier.